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Abstract 

We study the problem of the generation of a continuous random variable when a source 
of independent fair coins is available. We first motivate the choice of a natural criterion for 
measuring accuracy, the Wasserstein L^ metric, and then show a universal lower bound for 
the expected number of required fair coins as a function of the accuracy. In the case of an 
absolutely continuous random variable with finite differential entropy, several algorithms are 
presented that match the lower bound up to a constant, which can be eliminated by generating 
random variables in batches. 

Keywords: random number generation, random bit model, differential entropy, partition en¬ 
tropy, inversion, probability integral transform, tree-based algorithms, random sampling 


1 Introduction 

Knuth and Yao [lj showed that the expected number of independent Bernoulli (1/2) random bits 
needed to generate an integer-valued random variable X whose distribution is given by pi = P{Y = z}, 
where J2iPi = P is at least equal to the binary entropy of X: 

e = s(x) = £<fe}‘:;) ='^ R iog 2 (-). (i) 

They also exhibited an algorithm—dubbed the DDG tree algorithm—for which the expected num¬ 
ber of random Bernoulli (1/2) bits is not more than £ + 2. By grouping, one can thus develop 
algorithms for generating batches of m independent copies of X such that the expected number of 
Bernoulli (1/2) random bits per random variable does not exceed £ + 2/m. 

While these results settle the discrete random variate case quite satisfactorily, the generation 
of continuous or mixed random variables has not been treated satisfactorily in the literature. The 
objective of this note is to study the number of Bernoulli (1/2) random bits to generate a random 
variable X e M. d with a given precision e > 0, provided that we can define “precision” in a 
satisfactory manner. 

Note that any algorithm that takes as input the accuracy parameter e > 0, returns a random 
variable 

Y e = f T (B 1 ,...,B T ,e) 

where B\,B 2 ,... ,Bt are independent identically distributed (or i.i.d.) Bernoulli (1/2) bits, T is 
the number of bits needed, and ■ ■ are given sequences of functions. For a vector v € M d , 
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let ||u|| p denote the £ p -norm of v for p > 1: ||v|| p = (Ya =i \ v i\ p ') p ■ For p = oo, the oo-norm is 
Halloo = sup 1<i<d |uj|. With d = 1, all p-norms are the same for p E [1, oo]. 

An algorithm with accuracy e is such that for some coupling of X (the target random variable) and 
Y e , 

\\X-Y e ||p<e, 

where p is usually 2 or oo. This natural notion of accuracy corresponds to the Wasserstein L ^ metric 
between two probability measures. For a random variable X, we denote by jC(X) the distribution of 
X. Let M denote the space of all distributions of pairs of random variables (X, Y) E x with 
fixed marginal distributions F and G, respectively. Then the Wasserstein distance between X 
and Y, or between F and G, is 

W P (F,G) = inf {ess sup || X- Y || p : (X,Y)eM}, 


where ess sup denotes the essential supremum. This is a distance metric: 


dist(AT, Y) = W P (F, G). 


If 


dist(X, Y) < e 


then there exists a random variable Y (output) coupled with X (target) such that ess sup ||X — Y || p < e, 
i.e., with probability one, 11AT — Y|| p < e. This definition of distance should satisfy simulation pro¬ 
fessionals in the sense that if their calculations require the evaluation of 


^(ATi,... ,AT rf ), 


where X \,... ,X f ] are given independent random variables and f is a real-valued functions, then, 
with probability one, 

\V(Y 1 ,...,Y d )-y(X 1 ,...,X d )\< sup | *(y 1 ,...,y d )-*(X 1 ,...,X d )\ 

\\y-X\\ p <e 

which is usually a quantity that can easily be controlled. 

We believe that software packages should have the capability of accepting e as input parameter in 
random variate generation. It is interesting that E(T), the expected value of T, can be related to the 
entropy almost in the way Knuth and Yao did for discrete random variables in 11]. Our note provides 
the foundational background for such a study in terms of universal lower bounds and various 
useful upper bounds for particular algorithms. We include examples for the main distributions. 
Several authors have adressed the problem of arbitrary precision in sampling algorithms. These 
include Flajolet and Saheb |2|, who explain how to generate the first k bits of an exponential 
random variable for an integer k > 1. Karney |3| describes an algorithm for the standard normal 
distribution. Lumbroso’s thesis [4| also discusses arbitrary precision sampling. 

The quantity that appears in our lower bound is the partition entropy. More precisely, let A 
be a partition of and let e > 0 be a fixed parameter for the precision. The partition entropy of 
X with respect to A is the quantity 
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While our results apply to all distributions, we will mainly focus on absolutely continuous 
distributions, i.e., random variables X with density /. We recall the definition of differential 
entropy: 

£(/) '= [ f(x) log 2 ( —^ dx. 

JxeR d \f( x )J 

The differential entropy can be ill-defined, —oo, finite or +oo. For more information on differential 
entropy and entropy in general, one can read Cover and Thomas |5|. When X has compact 
support, then the case +oo cannot occur. When / is bounded, then the case —oo is excluded. 

When £{\fX\\, ..., [_ Xd\ ) < oo, it can be shown that £{f) is well-defined and either finite or —oo; 
see Renyi [6], Csiszar (7j| for a proof. 

Our main result is the following: 

Theorem. Let X G be a target random vector with density f, and assume that £([X±\ ,..., pCiJ) < oo. 
Consider any algorithm that for given e > 0 outputs Y = Y e using T random fair coins, such that 
W P (X, Y) < e. Then 

E IT) >£(/) + dlo g2 (T) - log 2 ■ 

For p = 1 and p = oo, the third term in the lower bound is log 2 (2 rf /d!) and d, respectively. For 
d = 1, it is 1. 

The second part of this note describes several algorithms that come within a constant term of 
this lower bound, and therefore, are basically optimal if grouping is used for generation. Before 
tackling all that, we introduce a brief section in which we recall the main exact sampling algorithms 
for discrete random variables, and their properties, as these will be essential for the understanding 
of the main algorithms. 

2 Bounds for the discrete case 

In this section, we give simple proofs of two important results for generating discrete random 
variables—the optimal algorithm of Knuth and Yao jl| and the more practical but slightly subop- 
tirnal algorithm of Han and Hoshi 81. We recall that we want to sample X with probability vector 
(pi,P2, ■ ■ ■)■ Every pi is decomposed into its binary representation 

00 b‘ ‘ 

Pi = Yl where b h] G {0, 1} for all i,j. 

3 =1 

Consider the new random variable Z with probability vector 


/& 1,1 

hi,2 

&2,1 

&2,2 

\ def ( 

\ 

V 2 1 ’ 


'' ’ ~2^~' 

’ ' 

■, ) = Pi, 2, 

• ,P2,1,P2,2, ■ ■ ■ , J 


with pij = bij/2 j . 

Any algorithm for generating a discrete random variable using a source of i.i.d. fair coins B\, 
B 2 , ... and that is based on a stopping time T —when it returns an output—can be viewed as 
a binary tree, where B 1 , B 2 , ... uniquely determines an infinite path in the tree by the rule “0” 
is left and “1” is right. We refer to this general class of algorithm as tree-based algorithms—they 
include all possible practical algorithms. Leaves in this tree correspond to outputs. The algorithm 
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of Knuth and Yao can be implemented by a binary tree, a DDG tree, in which each leaf at level 
j corresponds to a “1” bit bij in (2). One randomly walks down this tree starting at the root and 
reaches the leaf for b t] = 1 with probability 1/2- 7 =Vij • At that point, the value “z” is returned, 
and indeed, 

OO 

P{Y = %} = ^ ~2pij =pi , 

3 = 1 

as required. A tree-based algorithm is optimal if it minimizes E(T) for a given probability vector. 

Theorem 1 (Knuth and Yao |l|). The expected number of bits of an optimum tree-based algorithm 
for sampling (pi,P 2 , ■ ■ ■) is bounded from below by and from above by £({pi}fTi) + 2. 

Proof of Theorem^ 7} Given a probability vector (pi,P 2 , ■ ■ ■ ,p n ) with n possibly infinite, let the 
binary expansion of pi for i £ {1,..., n} be 

00 b- 

= and bi ’ j € {°’ 1 j‘- 

3 = 1 

If T denotes the number of bits required by an optimal algorithm for sampling (pi,P 2 , ■ ■ ■ ,Pn) , then 
for t > 1 

„ , number of leaves at level t 

= =- 2 ‘ - 

n u 

E Oi,t 

~ 2 P ' 

i =1 

Therefore, 


E(T) = ^tP{T = f} 

t =0 


oo n , 

t=i i =i 



(3) 


We now show that the quantity within parentheses of line ([3]) is lower bounded by pi log 2 (l /pi) and 
upper bounded by pjlog 2 (l/pj) + 2 pi and then the result follows. For convenience, let x £ [0,1] 
and its binary expansion be 


OO 



3 = 1 


To complete the proof, it remains to prove that 


xlog 2 




(4) 


If m is the first non-zero coefficient of the binary expansion of x, then there are two cases: either 
(1) x = 2~ m or (2) 2~ m < x < 2 1-m . The inequalities are strict for case 2 since x 2~ m . For the 
first case, x m = 1, and line is obviously true. For the second case, 

1 2 , . , 

or ’ 0<m + lo g2 W < L 
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Then for the upper bound, 


for the lower bound, 


oc . oc 



j =1 j=m +1 


m + 1 
— 2 m 

< x(m + 1) 

< x(2 — log 2 (x)) , and 


TG _ sr' J x j 

2-^ 2? ^ 2? 

3 =1 j=m 


oo 





= mx 

> x (— log 2 (^))- 


□ 

We now recall the Han-Hoshi algorithm published in [8| that implements the inversion method. 
Given (p \,... ,p n ) with n countably hnite or infinite, the algorithm partitions the interval [0,1] into 
a countable collection of disjoint subintervals [Qi-i,Qi) where 

Q o = 0, and 

i 

Qi = £>*, ie{l,...,n}. 

k= 1 

The idea behind the algorithm is to iteratively refine a random interval I C [0,1) and to stop 
when / C [Qi~i,Qi) for a certain i £ {1,... , n}. The inversion principle says that if U uniformly 
distributed on [0,1], then the unique i £ {l,...,n} such that Qi < U < Qi+i is distributed 
according to (p\,... ,p n ). For a binary random source of unbiased i.i.d. bits, their algorithm is as 
follow: 

Algorithm 1 The Han-Hoshi algorithm using a binary source 
1: T i — 0 

2: CtT 0 

3: fa 1 

4: repeat 

5: T £- T + 1 

6: B <— Random Bit 

7: ay •$— qt_i + Q3t—i — aT~i)(B/2) 

8: /3t <- a T _i + (/3t-i — Q!t-i)((-B + l)/2) 

9: I [«T, /?r) 

10: until / C [Qi-i,Qi) 

11: Return i. 
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The following two figures [l] and [2] are examples that show the underlying DDG tree during the 
execution of the Han and Hoshi algorithm. 



Figure 1: Illustration of the algorithm of Han and Hoshi on the vector ( Pi,P2,P3,P4,P5,P6,P7) = 
(tTi) T 7 ;, A, s). The cumulative values are q\ = = (0.00010)2, Q 2 = Zj = (0.00111)2, 

q 3 = ¥= ( 0 . 01100 ) 2 , g 4 = § = (0.10101) 2 , q 5 = § = (0.11011) 2 , q& = § = (0.11100) 2 , and 
q 7 = § = (1.00000) 2 . 



Figure 2: Illustration of the Han and Hoshi algorithm on the vector (pi,P2,P3,Pi) such that p\ = 
0.0001 • • ■, Pi + P 2 = 0.0101 • • •, and Pi + P 2 + P 3 = 0.1011 


Let T be the number of random coins needed and also the number of iteration of the “repeat” 
loop. For T > 1, [a^/dr) 3 \ut+ i,Pt+i)- To every node (internal or external) corresponds an 
interval The root corresponding to the interval [0,1). For each internal node corresponds 

an interval \olt,(5t) that is not contained in one of the interval [Qi,Qi + 1 ), and, if the source 
produces B = 0, then the left child corresponds to the interval [ay, (a-p + fir)/ 2)) = [ar+i, Pt+i) 
and, if B = 1, then the right child corresponds to [(err + /?t)/2),/3t) = [&t+i, Pt+i) • Each leaf 
(external node) corresponds to an interval [cut,/3t) entirely contained in \Qi,Qi+ 1) upon which 
the integer i is returned with probability Qi. 

The following theorem was proved by Han and Hoshi [8j: 
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Theorem 2. For the Han-Hoshi algorithm, 


\ / i \ X 

Y Pi l0g 2 (~) < E ( T ) <Y Pi l0g 2 

,_1 \PI/ i=l 


2= 1 


-)+3. 
Pi, 


Proof of Theorem [I| Our (new) proof partitions the leaves Li for symbol i in the DDG tree arbi¬ 
trarily into two sets, Ai and such that Ai and Bi each possess at least one leaf per level. Let 
a i = Ylu&A, p( u )> Pi = J2ueBi p( u )i where p(u ) is the probability attached to leaf u, i.e. l/2 depth (“). 
We have pi = a* + fa. By nesting and elementary calculations, 

X>i ° g2 (f) i £,<***, (f) + EAloft (I 


2=1 


Pi 


2—1 

OO 


s § ftlog2 W + L 


2—1 


Pi 


(5) 


Let ( af)j be the j-th bit in the binary expansion of a*, and let (Pi)j be the j-tli bit for fa. Then 

E( T) = f;i^ + f;i^ d ^ f i+iL 

3 =1 “ 3=1 

As in the proof of Theorem [lj we have 

Y ai log 2 ( < I < jr at log 2 ( + 2 a h 


2—1 

OO 


2 — 1 


2—1 

OO 


Y log 2 ( J. ) < 11 <YPi !og2 ( ^ ) + 2 pi, 


2=1 


Pi 


so 


that, using ([5]), 


Pi log 2 ( — ) < E(T) < Y^Pi l° g 2 ( ~ ) +! + 2 Y Pi 

i =1 VPlJ t=l VPl/ 

= f>io g2 (-)+3. 


2=1 


2=1 




3 Lower bound for generating continuous random vectors 


In this section, we give a lower bound for the complexity of sampling any continuous distribution 
to an arbitrary precision. Let A be a countable partition of M d , and let e > 0 be a fixed precision 
parameter. Consider the infinite graph G with as vertices the sets A € A, and as edges all pairs 
(A, B) e A x A such that 


inf 

x&A,y&A 


x - y\\ p < e. 


Therefore, if (A, B) is not an edge of G, then \\x — y\\ p > e for all x £ A, y e B. Let A be the 
maximal degree of any vertex of G. We now state a lemma that we shall use in conjunction with 
the Knuth-Yao result, mentioned and reproved in the previous section, in order to prove our main 
theorem mentioned in the introduction. 
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Lemma 1. Let X be a target random vector ofM. d . Let Y be an output with the property that, with 
probability one, \\X — Y\\ p < e. Let T be the number of bits used to generate Y by an algorithm. 
Then 

E(T) > sup {£ a (X) - log 2 (A + 1)}, 

,4 

where A and A are as above. 

We can maximize the bound from Lemma [lj of course, by selecting the most advantageous 
partition A and A combination. The bound from Lemma [I] coincides with the bound in Shannon |9| 
when the distribution X is discrete with a finite number of atoms since, in that case, A = 0 by 
choosing e sufficiently small. 

Proof of Lemma [1} Let X and 
Pab = P{X E A,Y E B}. Note 

\£a(X) - £ A (Y)\ = 


< 


< 


If T is the random number of bits needed to generate a discrete random variable Y that outputs 
a vertex A of G with probability P{Y E A}, then 

E(T) > £ a (Y) by Knuth-Yao, 

>£U(Y)-log 2 (A + l), 


Y be two (dependent) random variables of M rf , and denote by 
that pab = 0 if {A, B ) is not an edge of G. Thus 

(A,B)eAxA \ 1 J / 

(by Jensen’s inequality) 

( P{X E A, Y E B} 


log 2 

log 2 

log 2 


E 

. ( A,B)gAxA 


P {Y E B} 


y P{Y E A} 

ep { ^^fE p{ p { ^r > 


.B£A 


. AeA 


^P{YeB} (A + l) 
WbgA ) / 

log 2 (A + 1). 


( 6 ) 


and therefore 


E(T) > sup {£ a {X) - log 2 (A + 1)}. 
A 


□ 

It is interesting to recall a general result from Csiszar Jr] about the hypercube partition entropy 
of an absolutely continuous random vector X that will become useful later. Of particular interest 
to us is the cubic partition A\ partitioned by h > 0. The cells of this partition are of the form 

d 

[ijh, (ij + l)h ), (*i, ..., i d ) E Z d . 
i=i 
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def 


We recall that if [X\ = (L-XiJ,..., [X^J) has finite entropy—a condition we refer to as Renyi’s 
condition—then if X has a density /, 


f(/) = //log 2 (±) 


is well-defined, i.e., it is either finite or —oo. We have 

Lemma 2. Under Renyi’s condition, for general partition A, and random variable X E with 
density f, 

£a(X) > £(f) +^P{IG A} log 2 ( 1 
AgA 

where A(-) denotes the Lebesgue measure. In particular, 


A {A)J* 


£ A* h (X) > £(f)+dlog 2 Q 
Proof of Lemma\^ Fix A € A. If Z is uniform on A and Y = f(Z), then 

P {X E A} = [ f = X(A)B(Y). 

JA 

Thus 


P{X E A} 

HA) 


log 2 


A(A) 


P{X E A} 


= E(y)io g2 


1 


E(T) 


and, by Jensen’s inequality and the concavity of xlog 2 (1/a;), 


E(T) log 2 


E (Y) 


> e( yio g2 ( p 


i 


A(A) 


/ log 2 ( j 


The inequality follows by summing over A E A. 


□ 


Lemma 3 (Csiszar |7|). Let X E have density f, and let Renyi’s condition be satisfied. If 
£(f) > —oo, then as h \. 0, 


If £(f) = —oo, then as h f 0, 


&AI (X) — £(f) + d log 2 ( — ) + o(l). 


£At(X)-d log 2 ( - 


—oo. 


Remark 1. The fifth theorem of Csiszar ^ stipulates that if£([Xi \,..., < 00 an d f is not 

absolutely continuous, then, as h —>■ 0, 


£ai(X) - d log 2 ( - 


—oo. 


For more information about the asymptotic theory for the entropy of partitioned distributions 
as the partitions become finer, one can consult Renyi |6], Csiszar [7], Csiszar |10|, and Linder and 
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Theorem 3. Let X E M rf have density f. Let Y be an output with the property that with probability 
one, ||X — Y\\ p < e. Then, under Renyi’s condition, 

E(T) > £(f) + dlog 2 Q) - log 2 V diP , 

where 

2 d T(l- 
Vd ' p ' ?(*+ 

is the volume of the unit ball in and T is the number of random bits needed to generate to Y. 
Proof of Theorem [3| Let A* h be a cubic partition. Then 

E(T) > sup fs A *(X) - log 2 ( A h + 1) 

h \ 

where A h is the maximal degree in the graph on At x At defined by connecting A E A* t with 
B E A* h if inf x&A ,y&B ||— y\\ p < e. We set h = e/n and use 

E(T) > limsup (s A * (X) - log 2 (A e/n + l) Y 

n—>■ oo \ e '" / 

Observe that if B r denotes the Lp-ball of radius r centered at 0, then by elementary geometric 
considerations, 

< A < 

h d ~ h ~ h d 

so that as n —)> oo, 

A e/n ~ Q A (B c ) = V d)P n d . 

Also, 

S A* Jn {x) >£(f) + d\og 2 Q 

so that 



£ A* e/ jx) ~ log 2 (A e /„ + 1) 

> £•(/) + ()) + io& ( 1 + Vjj „"( 1 + t , (1)) ) 

n ^°£(f) + d log 2 (^j -\og 2 V d , p . 


4 Upp er bound for partition-based algorithms 

Consider a random variable X E We call a partition A e a e-partition if for every set A E A e , 
there exists x A E A (called the center) such that 

sup \\x A - y Up < e. 
y&A 
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Then any algorithm that selects A G A e with probability p(A) = P{ A G A} = f A f can be used 
to generate a random variable Y that approximate X to within e. After generating A , set 

Y = x a . 

Then, necessarily, there is a coupling (X, Y) with ||X — Y || p < e. If the selection of A is done with 
the help of the method of Knuth and Yao, then, if T still denotes the number of random bits 
required, 

E(T) < S At {X) + 2. 

For p = oo, we can take A e = A 2e , the cubic partition with sides 2e. For d = 1, a simple partition 
into intervals of length 2 e can be used for all values of p. 

If X has a density / and p = oo or d = 1, then the procedure suggested above has, as e 4- 0, 

E(T)<£ a *JX) + 2 

< £(/) + d\og 2 ^ + 2 + o(l), 

= £(/) + d\og 2 + 2 — d + o(l), 

where in the last step we assume Renyi’s condition and £(/) > —oo. Compare with the lower bound 

E(T) > £(f) + dlo g 2 (^) -d, 

and note that the difference is 2 + o(l). 

For later reference, we recall these values of £(f ) for the main distributions: 

Uniform[0,1]: £{f) = 0, 

Exponential(l): £ (/) = log 2 (e), 

Normal(0,1): £(f) = log 2 V2ne. 

Recall that for X G M 1 , a > 0, a scale factor a shows up as log 2 (a) in the upper and lower bounds 
because 

£{aX) = £(X) + log 2 (a). 

For general pG [1, oo), we can take 


A e — A i, 

2 e/dP 

l 

the cubic partition with sides 2 e/dp. Under Renyi’s condition and £(f) > —oo, we have 
E(T) < £ 4 * x + 2 

2 e/dP 

< dlog 2 +£(f) + 2 - d + ^log 2 (d) + o(l). 

The difference with the lower bound is 

D = 2 + — log 2 (d) + dlog 2 r + 1^ — log 2 r + 1^ + o(l). 
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Using r(l + u) > [u/e) u \Z2 ttu, u > 0, we obtain 

D < 2 + dlog 2 (YQ + l^j(ep)p^j - * log 2 ^2tY^ +o(1), 

which unfortunately increases linearly with d. To avoid this growing differential—which we did not 
have for p = oo—it seems necessary to consider partitions that better approximate £ p -balls. 


5 Upp er bound for inversion-based algorithms 


The inversion method for generating a random variable X with distribution function F uses the 
property that 

X = F~ l (U) 

has distribution function F. where i 7-1 denotes the inverse, and U is uniform [0,1]. One can use 
this method as a basis for generating an approximation using only a few random bits. In particular, 
if 

Uj 

u = o.u 1 u,- = '£,^t. 

3=1 

and Ui, U r 2 , ■ ■ . are independent Bernoulli(l/2) random variables, then setting 


(t) 


then 

l 

Note that C/( 0 ) = 0, = 0.1111 ••• = !. 

Graphically, we have 


O.Ui 

■■■Ut, 

O.Ui 

■■■Ut + Jt 

O.Ui 


\i) ^ 

U <U+. 



Figure 3: Inversion method illustrated 
The number of random coins is 

T = rninp >0 : F' 1 (U+) - F~ l < 2e}. 
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If we define 


„ F-‘(C/+)+F-i(t/ (t) ) 

2 

then X and Y are coupled in such a way that 

\X - Y\ < e. 

The T defined above is also the number of bits needed to generate Y. 

Observe that the inversion method requires F in a “black box”, also called an oracle. On 
the other hand, it avoids the cumbersome calculation of the cell probabilities and the set-up of the 
Knuth-Yao DDG tree, and thus shines by its simplicity. 

In spirit, the inversion method mimics the method of Han and Hoshi, and indeed, this observa¬ 
tion leads to a simple bound. Let A* 2( be a cubic partition of M 1 into intervals of equal length 2e. 
Denote the probabilities of these intervals by 

P (A) = P{XeA}, A e A% t . 

Assume that we select an interval from A 2e following this law by the method of Han and Hoshi 
using random bits Ui, U 2 , U 3 , ... also used in the inversion method. It is easy to see that the 
number of bits needed before halting in the inversion method is smaller. Therefore, for the inversion 
method, we have 

E(T)<£ AL (X) + 3. 

From this and Lemma [3j we conclude 

Theorem 4. If X has a density f satisfying Renyi’s condition and if f /log 2 (l//) > —oo, then 
as e | 0 , 

E(T) < log 2 + //log, 0^+2 + o(l). 

Remark 2. Comparing Theorem^ with the universal lower bound 

E(T) > log, 0) +£(/)- 1, 
we see that the difference is 3 + o(l). 

Remark 3. The partition-based method has un upper for E(T) that is one less. Moreover, the 
simplicity of the inversion method cannot be underestimated. In addition, one can tighten the 
analysis under additional conditions on f such unimodality, monotonicity, or for specific forms. 

Theorem 5. Assume that X has a bounded nonincreasing density f on [0, 00). Then for the 
inversion method, as e f 0, 


E(T) < log 2 0^ + j f log, +o(l). 

Proof of Theorem 0 Define X t = F X+ = F as in Figure |3j Then 

OO 

E(T) = ^P{A+-A t >2e} 

4=0 
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oo 

<£p{/(x+)<^} 

t =o 

^because \ t = \ x f- f( x t)( x t ~ -*t)j 

OO 1 OO 

5 E p {/m < 2^} + E p {/W) < ** < /(*>} 

t=o t =o 

= 1 + 11. 


Now, 

ise { 1+1oS2 ++)} 

= l°g 2 (j)+// log 2 (i) 

(even if the latter integral is infinite). The Theorem follows if we can show that II = o(l). To this 
end, note that 

oo 1 

II <E p {/(+)< 5 = </(*<)}• 

t =o 

For a fixed value of t, we see that f(X^) < 7 ^: E f( x t) only if X falls in the interval that 
“captures” the value if such an interval exists. But the probability of each interval is precisely 
1/2*. However, if 7^1 > /(0), then no such interval exists. Thus, 


If 

t =0 

< 4e/(0) = o(l). 


2e/(0) 


IH 

For the uniform [0,1] density, we have £(/) = 0, and so the bound of Theorem [ 5 ] is 

E(T) < log 2 Q +o(l). 

However the “o(l)” can be omitted in this case as the following simple calculation shows: 

00 

E(T) = ^P{T>f} 
t =0 

00 2 t —l 

= EE 

t =0 i=0 
oo 2 t —l 

= EE 

t=0 z—0 
00 

“ X] ^{*<log 2 (^)} 

t=0 
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Theorem [5] improves over the bound for the partition method (for d = 1) by 1 + o(l). Under other 
regularity conditions, one can hope to obtain similar bounds that beat the partition bound. 

For the exponential density, the inversion method yields 


E(T) < log 2 
= log 2 



+ £(/) + °( 1 ) 

+ log 2 (e) + o(l), 


where log 2 (e) = 1.443... Flajolet and Saheb j 2 | proposed a method for the exponential law that 
has 

E{T} = log 2 + 5.4 + <p(e), 

where |</j(e)| < 0.2 as e 0 . 

For the normal law, Karney [3] proposes a method that addresses the variable approximation 
issue but does not offer explicit bounds. Inversion would yield 


E {T} = log 2 



+ log 2 V2-ire + o(l), 


but the drawback is that this requires the presence of (an oracle for) F 1-1 , the inverse gaussian 
distribution function. Even the partition method requires a nontrivial oracle, namely F. To 
sidestep this, one can use a slightly more expensive method based on the Box-Miiller jl2], which 
states that the pair of random variables 

(V2EV 1 ,V2EV 2 ) 

with E exponential and (Iq, V 2 ) uniform on the unit circle, provides a standard gaussian in M 2 of 
zero mean and unit covariance matrix. The random variable \/‘2E is Maxwell, i.e., it has density 
re~ r / 2 , r > 0 , and its differential entropy is 

/Maxwell (r) log 2 ( -- , \ ) dr 

\ /Maxwell U)/ 

= si2 G ( 7 ■ iog(2> ) + 0 

= 1.359068 ..., 


^(/Maxwell) — 


where 7 = 0.577215 ... is the Euler-Mascheroni constant. 

We sketch the procedure, which also serves as an example for more complicated random variate 
generation problems. Assume that the two normals are required with e-accuracy each (this corre¬ 
sponds to the choice of d = 2 and p = 00). Then we first generate a Maxwell random variable M 
by inversion, noting that 

F{r) = 1 - e 2 , 
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F 1 (u) = \J—2 log(l - u). 


The Maxwell random variable M is needed with |-accuracy. The Maxwell law is unimodal with 
mode at r = 1. Its left piece has probability 1 — So we first pick a piece randomly using on 
average no more than two bits. The we apply inversion on the appropriate piece. By Theorem [5j 
we use T\ random bits where 

E(Xi) < log 2 + £ (/Maxwell) + 2 + o(l). 

The generated approximation is called M'. 

Next we generate a uniform random variable U E [0, 27t) with accuracy 

e/2 

M' + (e/2)' 

The generated value U' E [0, 2ir) has \U — U'\ < m'+U/ 2 ) ■ Since U has differential entropy log 2 (27r), 
we see that the expected number of bits, T 2 , needed is bounded by 


E(T 2 ) <E(^log 2 
< E ( log 2 ( 


(M' + (e/ 2) 


V e/2 

(M+ (e/2) 


+ log 2 (27r) + o( 1) 
+ log 2 (27r) + o(l) 


Then we return 
and claim that jointly, 


V e/2 

= lo S 2 0^ + E( log 2 (M)) + log 2 (2vr) + o( 1) 
(by the dominated convergence theorem). 

(M'sm(U') , M'cos{U')), 


\M'sm(U') - M sin(Z7)| < e 
\M' cos(U') — M cos(C/)| < e. 


To see this, note that 


sm(U') — sin(f7)| < \U-U’\ < 


e/2 


M' + (e/2) 


and similarly for the cosine. Next, 


\M' sin(C/ / ) — M sin(C/)| < |M 7 - Af|| sin(f7')| + M\ sm(U') - sin(C/)| 

< | M’ - M\+M\U' - U | 

e e/2 

< - + M - — 

~ 2 M' + (e/2) 

< e e 
“ 2 + 2 
= e. 
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Putting everything together, we see that the total expected number of bits is not more than 


E(Ti) + E(T 2 ) < 21og 2 
= 2 log 2 



+ E( log 2 (M)) + £( /Maxwell) + 2 + log 2 (27r) + o(l) 
+ 2 + log 2 (27re) + o(l) 


= 2 log 2 ( - ) + 6.094191... + o(l). 


The lower bound for generating two independent gaussians is 4 + o(l) less, i.e., 


2 log 2 


+ 2 log 2 V^ne — 2 = 2 log 2 


+ 2.094191... 


6 Batch generation 

6.1 Randomness extraction 

Turning a sequence of i.i.d. random variables X\, X 2 , ■ ■ ■ into a sequence of i.i.d. Bernoulli (1/2) 
bits has been the subject of many papers. The setting of interest to us is the following. Let F\, T 2 , 
... be a (possibly infinite) number of cumulative distributions supported on the positive integers. 
Let (pi,p 2 -...) be a fixed probability vector. Let X , X\, X 2 , ... be i.i.d. random integers drawn 
from (pi,P 2 , ■ ■ •)• Given (X±, X 2 ,..., X n ) for n G N, draw (Yj, Y 2 ,..., Y n ) independently from the 
distributions F Xl , F X2 , ..., F Xn . 

As a special case, we have the classical setting when p\ = 1 and then Y \, 1 2 , ..., Y n are i.i.d. 
according to F\. 

Let F \, F' 2 , ... have binary entropies given by £\, £ 2 , • • •, all assumed to be finite. In other 
words the entropy of Y \ X = i is denoted by £ t . Assume also that 

OO 

n def n 

C = 2 _^Vi£i < 
i =1 

Theorem 6. There exits an algorithm (described, below) that, upon input (X\,Y{), (A 2 , Y 2 ), 
lyX n ,Y n ) outputs a sequence of R n i.i.d. Bernoulli (1/2) bits where 

R n P, c 

- > t as n —^ 00 . 

n 

Furthermore, these bits are independent of (X\,... ,X n ). 

Theorem [6] describes how many perfect random bits we can extract from Y\, Y 2 , .. ■, Y n , i.e., 
R n should be near the information content, the entropy of (Yi,1 2 , ... Y n ). Not surprisingly then, 
the way to achieve this can be inspired by the optimal or near-optimal methods of compression, 
and, in particular, arithmetic coding. 

Note that one can assume Y t = F x) (Vi) for i G {1,..., n} where Vj, V 2 ,.. ■, V n are i.i.d. uniform 
random variables on [0,1]. For all i G N, let the binary expansions of pi be 

XI , 

Pi V 2 ' where b l3 G {0,1}. 

3 = 1 
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„ . def 

for convenience, pij = 


"*3 

2J 


for all (i,j) G N x N. Also, for all i G N, let 


Fi(j) 


0 if j = 0, 

pT Ei=i Vik if J > 1- 


Following the methodology of arithmetic coding, associate a uniform[0,1] random variable U 
(with binary expansion O.C/ 1 C /2 • • •) with the (infinite) data sequence (Xi, hi),..., (X n , Y n ). The 
bits Ui , i > 0, are i.i.d. Bernoulli(l/2) and independent of Xi, X 2 , X 3 ,... To be more precise, 
consider this algorithm. 


Algorithm 2 Randomness extraction 

Input: A sequence of pairs (Xi,Yi), (X n ,Y n ) with Xe and Ye as described previously for 
t G {1,... ,n}. 

1: Uq <- 0 

2: t/ 0 + <- 1 

3: for t = 1 to n do 

4: ^ 4- t/ £ -, + (U+ , - Ui_ t )F X ,(Y, - 1) 

5: [// <- UJ _, + (t/+, - U'Z.jFx.m 

6: end for 

7: i? n •(— max {t > 0 : [2 t U-\ = [Fu+W { R n is the number of bits of the longest prefix 

common to both U~ and [/+.} 

8: return [2 Rn U ~J 


To verify the correctness of Algorithm [ 2 J the intervals [Ug , U ^] are nested. More precisely, for 
all L 

Pi,U +] 2 lU^.U+i. 

Define U = limsup^^ U~ = lim inf n ^oc U+, and note 


Ue[U-,U+] = f) [Uf,U+). 

1=1 

For every iteration, we have 

U-U- r , 

—x-— = Uniform)!/- ]. 

U+-UJ 

Since R n = rnaxjt > 0 : )2 i !/“J = |_2 t L/+J}, 

<ui<u< ui < + 1 ). 

The bits C/i, U 2 , ..., t/ft n are clearly i.i.d. 


Proof of Theorem Let f Gl and consider the two cases {R n > t} and {R n < t}. We show that 
t is concentrated around n£. Before considering the two cases, we compute the useful quantity 




Pij 


OO OO OO 

= Y J Pi^g 2 (Pi) + Y.Y, Pij log 2 

i =1 i =1 j =1 
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= -£{X) 
= -S{X) 

def^ 

Note that 

{R n > t) C 


The pairs (Xi, Yi),..., (X n . Y n ) are i.i.d. and therefore, by the previous calculation, 

E M^)H 

By the law of large numbers, for all e > 0, P{i? n >n(£ + e)} —>■ 0 as n—»oo. We have 

{R n <t}c{[2 t U+\>[2 t U-\}. 

By the law of large numbers, for all e > 0, if t = n{£ — e), 

For an arbitrary fixed integer k > 0, 

p {Rn < t + k} < p|c/+ - U- > + p|c/+ - U- < l2*U+\ > [2*U~\ } 

- °( 1 ) + 

which is as small as desired by the choice of k. The 2/2 k term is due to the fact that the event 
{[/+ - U~ < 2 tysr, [2*17 +J > } occurs only if \U - and m G N. □ 

6.2 Batch generation algorithm based on a general DDG tree algorithm 

Assume given a random variable X G N with fixed probability vector (pi,p r 2 , • • •) of finite entropy 
denoted by £(X). Assume that we employ a given DDG tree based algorithm for the generation 
of X. In this tree, let L be the set of leaves and let label(u) be the label of leaf u G L. Define 

Li = {« G L : label(u) = i} for i G N. 

Let d(u) be the depth of u G L. Then we have 

p{x = i}« K =£ Aj. 

u(E.Li 




+ 


i=i j =i 


Pi 


Pi 


+ Pi<?i + £ (X) 


i —1 


u:-u-<- t } 

n 

TT PX(Y ( 


X>g 2 (^-) 


> t >. 


(7) 
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If the algorithm returns a variable X, then we know that we must have exited via a leaf in Lx- 
Given X = i, each exit-leaf has a given probability: 


P|Exit via leaf u G L*} =-. 

Pi 

Let us call the (random) exit-leaf Y. The DDG tree algorithm thus returns a pair ( X , Y). We have 

OO OO 11 

( y I * = *) = X] -^y l°g 2 (p*2 d(u) ) 

2=1 2=1 U^zLi 1 

oo 1 OO 

log 2 (2 d(u) ) + log 2 (w) 

2=1 UELi 2=1 

-Ess'-*. (**>)-wo 

ilEL 

= £(Y)-£(X). 

For example, for the Knuth-Yao DDG algorithm, we have £(Y) — £(X) < 2, while for the Han- 
Hoshi algorithm, we have £(Y) — £(X) < 3. Our method of batch generation will be valid for any 
DDG tree with finite £(Y). 

The algorithm for batch generating i.i.d. random variables X\, X 2 , ..., X n uses an atomic 
operation FetchBit that first gets a bit from a queue Q if the this queue is not empty, and 
otherwise it gets a bit from a Bernoulli (1/2) generator. It is understood that that FetchBit drives 
the DDG tree algorithm. 


Algorithm 3 Batch generation 
1: {Initially, the queue is empty.} 

2: Rq 0 {There is no “recycled” bit initially.} 

3: for i = 1 to n do 

4: Generate (W, 1}) by a DDG tree algorithm. {The DDG algorithm uses the operation 

FetchBit to get bits either from the source or from the queue Q.} 

5: return X t 

6: Feed (X,;, Y,) to the retrieval algorithm (randomness extraction procedure), and recover Ri — 

Ri—i bits which are added to Q. 

7: end for 


Theorem 7. The batch generation algorithm uses N n random bits, where 

X n V r-/ v \ 

- > c (A ) asn-4 oo, 

n 

whenever £(Y) < oo. 

Remark 4. By the Knuth-Yao lower bound, 

£(N n ) > n£(X), 

and therefore, the procedure is asymptotically optimal to within o p (n) bits. The symbol o p (n) means 
it is o{n ) in probability as n —> oo. 
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Proof of Theorem [?| We choose a large integer constant k and look at N n k . Let Qt be the size of 
the queue at time t, and set Qo = 0. For j £ {1,..., nk}, let Tj be the number of bits needed for 
generating Xj without extraction. The Tj are i.i.d. random variables. Then we have the following 
simple identity: 


nk 


N n k — ^ ^ Tj Rnk H” Qnk • 
3 = 1 


By the law of large numbers, 


T\ + T2 + ... + T n k 

E(7\ + T 2 + ... + T n ]f) 


1 as n —> 00. 


We note that E(Ti + T2 + ... + T n k ) = nk£(Y ) because random variables Tj are i.i.d. By Theorem 
we also have that ( R n k/nk ) A £(Y) — £(X) as n —> 00. Therefore, 

Nnk = £(Y) + op(l) - {£{Y) - £{X)) + o p (l) + Qnk 


nk 


nk 


— £ ( X) + o p (l) + 


Qnk 

nk 


The result follows if ( Q n k/nk ) A 0 as n —> 00. For this, we need only consider an upper bound, 
since Q n k > 0, and then 

Qnj "£ Qnj —1 T {Rnj ~ ^n(j— 1)) min Qn(j—1)}- 

±SjSk 

Since ( R n /n ) A £(Y) — £(X), we have 


max 
1 <j<k 


Rnj R-n(j —\) 


n 


(£{Y)-£{} 0) 


( 8 ) 


and 


max 

l <j<k 


T n j T n (j_ r) 


n 


~£(Y) 


0 . 


(9) 


Fix e 0, and let .4 be the event that both lefthand sides in (|Sj) and © are less than e, so that 
P{A C } = o(l). The critical observation is that on A, 

[ QnU- 1 ) + {£(Y) - £(X) + e)n- {£(¥) - e)n 
Qnj < l if Q n (j- 1) > {£(X) - e)n, 

[ Qn(j-i) + (^(E) — £(X) + e)n — else. 

< max {Q n(i _D, (T(T) - £(*) + e)}, if 2e < T(X), 

and therefore, 


max < (£ (E) - £(X) + e) 
l<j<k 


n 


and 


Qnk K £(Y) - £(X) + e 
nk ~ k 
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If we choose k large enough such that (( £{Y ) — £{X) + e)/k) < e, then 


□ 

If batch generation is applied to the partition method for continuous distributions, and the 
Knuth-Yao or Han-Hoshi method are used for the discrete part of that method, then the expected 
number of random bits needed per random variable, under Renyi’s condition and £(/) > — oo, is 
bounded from above by 

dlog 2 +£(f)-d + o(l), 

thus matching the lower bound for the case p = oo, and all dimensions. 

7 Conclusion and outlook 

Using the notion of maximal coupling between a generated random variable and a theoretical target 
random variable, we were able to lay the groundwork for a theory of generating with universal lower 
bound in terms of the number of random Bernoulli 1/2) bits needed. In order to grace the world’s 
software libraries with variable accuracy generators, much more work is needed, both algorithmic 
and theoretic. We have shown that the well-known inversion method is nearly optimal in an 
information-theoretic sense, and will submit further work and other paradigms in the near future. 


Appendix: Generation of an exponential by convolution 


The method given in this appendix is based on taking the sum of independent random variables. 
Using convolution does not require an oracle for F or F -1 which were required by the algorithms 
given in this paper. Its range of applicability seems however restricted to the uniform and the 


exponential as suggested by Kakutani’s result 13 that can be stated as follows: 


Theorem 8 (Kakutani (1948)). For all i E N, let pi E [0,1] and let Xi be independent Bernoulli 
random variables such that P {Xi = 1 } = Pi- If X = X{2~ 1 , then 


OO 

X is singular <=> E l Pi 

i =1 
oo 


diverges, 


/ i \ z 

X is absolutely continuous ipi~ ~J converges, 


2=1 

OO 


X is discrete ]^[ ( - + 


2=1 


Pi 


> 0 . 


First of all, as shown in 14 , if X is an exponential mean 1 random variable, then [X\ is 


distributed as a geometric random variable with parameter 1/e, and {Y}, the fractional part of X, 
is distributed as a truncated exponential random variable on the interval [0,1), and [YJ and {Y"} 
are independent. We concentrate on the fractional part therefore. The following theorem tells us 
that the fractional part is the convolution of independent Bernoulli random variables. 
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Theorem 9. Let {X \...., Xj, .. .) be a sequence of independent Bernoulli distributed random vari¬ 
ables with 


P{Xj = 2 J '} = Pj £ [0,1], and 
P{Xj = 0} = 1 — pj for all j G N. 


Let X = Y^jLiXj. If 


then X is a truncated exponential random variable, 


-1/2-J 


ie., 


f/ie p.d.f. of X is 


Proof of Theorem [5| Since 


the Fourier transform of Xj is 


fx 0*0 = 7 -^-T for x G [0,1]. 

1 — e 1 

g - 1 / 2 - 7 

^ = e-V^' + l’ 


E(e lXjt ) = pje lt / 2j + (1 — pj) 

e ((-i+*t)/2*) + x 
= e -1 / 27 + 1 ' 


Since X is the sum of the independent Xds, we have 


E(e* xt ) = 


Y[ P(e lXjt ) 

3 =1 

e ((—l+*t)/2»’) + x 


n 


3 = 1 

X _ e (-l+it) 


e- 1 / 2 ^ + 1 


1 


1 — e _1 — 1 + it’ 


which is the Fourier transform of fx{x). 

We can thus generate {X} with precision e if we set k = |dog 2 (x)], and let 


( 10 ) 

□ 


k k 

Y = ^ Xj = ^ Bernoulli (pj). 

3 =1 j'=i 

Since a raw Bernoulli random variable requires 2 bits on average, this simple method, which has 
accuracy guarantee \Y — {X}| < e, uses an average not more than 2k = 2[dog 2 (x)l bits. On the 
other hand, the lower bound for {X} is 

E(T) > log 2 Q+£({X})-1, 

where 

^■({X}) = —— log 2 (e — 1). 
e — 1 


23 









The factor “2” in 2[log 2 Q)] can be avoided when batch generation is used. It can also be 
eliminated—at a tremendous storage cost— if the vector (2 1 Xi,..., 2 k X k ) is generated using the al¬ 
gorithm of Knuth-Yao since we know the individual probabilities, i.e., P{(2 1 X \ 1 ..., 2 k X k ) = (2 1 xi,..., 2 k x k )} 




3 = 1 


for all {x \,..., Xk) G {0, l} fc . One can show that 


k 

£((2 1 X 1 ,...,2 k X k ))=Y j nXj) 

3 = 1 

= E (» lo & (^) + (1 - ») lo & (rv^)) 

^ . e- 1 , 1 

< l°g2- \~ k + — — . 

e (log(2))2 /c 

Thus, for the Knuth-Yao method for this vector, 


E(T) <££(*,-)+ 2 
i=i 


< 


log 2 ( - 


+ 3 + log 2 


e — 1 


+ o(l). 


Again using Knuth-Yao, a geometric (1/e) random variable can be generated exactly using not more 
than 

(e - 1) log 2 + log 2 (e) + 2 

random bits. An exponential random variable generated by this method has an overall expected 
complexity 

E(r,<log 2 ()) + 7.360698... + »,1) 


as e | 0. 
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